MATEO: intermolecular α-amidoalkylation theoretical enantioselectivity optimization. Online tool for selection and design of chiral catalysts and products

The enantioselective Brønsted acid-catalyzed α-amidoalkylation reaction is a useful procedure is for the production of new drugs and natural products. In this context, Chiral Phosphoric Acid (CPA) catalysts are versatile catalysts for this type of reactions. The selection and design of new CPA catalysts for different enantioselective reactions has a dual interest because new CPA catalysts (tools) and chiral drugs or materials (products) can be obtained. However, this process is difficult and time consuming if approached from an experimental trial and error perspective. In this work, an Heuristic Perturbation-Theory and Machine Learning (HPTML) algorithm was used to seek a predictive model for CPA catalysts performance in terms of enantioselectivity in α-amidoalkylation reactions with R2 = 0.96 overall for training and validation series. It involved a Monte Carlo sampling of > 100,000 pairs of query and reference reactions. In addition, the computational and experimental investigation of a new set of intermolecular α-amidoalkylation reactions using BINOL-derived N-triflylphosphoramides as CPA catalysts is reported as a case of study. The model was implemented in a web server called MATEO: InterMolecular Amidoalkylation Theoretical Enantioselectivity Optimization, available online at: https://cptmltool.rnasa-imedir.com/CPTMLTools-Web/mateo. This new user-friendly online computational tool would enable sustainable optimization of reaction conditions that could lead to the design of new CPA catalysts along with new organic synthesis products. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-024-00802-7.


Introduction
Chiral Phosphoric Acid (CPA) and related catalysts are widely recognized and versatile tools in catalysis and organic synthesis useful for the synthesis of chiral drugs products [1][2][3].The selection and design of new CPA catalysts for different enantioselective reactions has a dual interest because new CPA catalysts (tools) and chiral drugs or materials (products) can be obtained [4].However, this process is difficult and time consuming if approached from an experimental trial and error perspective.Quantum Computational Chemistry tools may help to unravel the mechanism of reactions and help in the design of new CPA catalysts [5,6].Unfortunately, these techniques are less useful when it is necessary a fast scanning/optimization of new CPA catalysts for large libraries of reactions with diverse substrates, nucleophiles, products, and conditions (temperature, time, catalyst load, etc.).Cheminformatics methods relying upon Artificial Intelligence/Machine Learning (AI/ML) algorithms could help to speed up the discovery of new molecules [7][8][9] and in the design new chiral catalysts and products without engaging in a long term, empirical or quantum investigation [10][11][12][13].Therefore, there is a need to develop fast-track computational tools able to predict the enantiomeric excess saving time and experimental resources.However, the application of AI/ML techniques to the study of enantioselective reactions is still uncommon due to the inherent complexity of the problem.In addition, most models are not implemented in public online web servers or they are not available for researchers or companies.In this context, it is remarkable Sigman's et al. platform for CPA catalysts and organophosphorous ligand design [14,15].In these works, the authors predict reactivity using structural information of the query reactants/products.However, useful experimental/operational conditions of already known reference reactions similar to the query reaction are not considered.Recently, our group has faced this problem by introducing the Perturbation-Theory and Machine Learning (PTML) approach that employs as inputs both vectors of structural variables D kqi and vectors of multiple experimental conditions c qj .These PTML algorithms have been applied in medicinal chemistry, vaccine design, nanotechnology, and in catalysis as well [16][17][18][19][20][21].
In fact, we have previously reported a preliminary PTML model for the design of CPA catalysts for intermolecular α-amidoalkylation reactions [22].However, the model was not implemented on a public online web server and is difficult to use by an experimentalist.
Consequently, in this work, we are going to focus on the development of a public web server for the selection and design of CPAs catalysts for enantioselective intermolecular α-amidoalkylation reactions (Scheme 1).In these reactions, the protonation of an α-hydroxylactam by the CPA would give a chiral conjugate base/N-acyliminium ion pair, which would be trapped by a nucleophile enantioselectively, generating a new tertiary or quaternary stereocenter [23,24].The α-amidoalkylation reaction of aromatic systems using N-acyliminium ions as electrophiles is a Friedel-Crafts-type reaction that has found widespread application in organic synthesis for the production of new drugs and natural products [25,26].For example, we have applied the procedure to the enantioselective synthesis of Nuevamine type alkaloids.Thus, indol and acyl moieties can be easily introduced in the alpha position of the nitrogen atom, using sterically demanding BINOL-derived CPA catalyst [27].However, the enantioselectivity of these CPA catalyzed reactions is sensitive to many factors, from the nature of the nucleophile and the catalyst to the experimental conditions (solvent, temperature, etc.).In this context, many efforts have been made to understand the role of non-covalent interactions in organocatalyzed reactions and to rationalize and predict their stereochemical outcome using Quantum Chemical methods [28][29][30].However, the chemical space accessible by organic synthesis is very wide, and all compatible combinations of substrate, nucleophile, catalyst, and solvent should have to be scanned.
Therefore, the use of Cheminformatics models to explore the chemical space of these reactions becomes a very interesting option in order to reduce costs and time.Therefore, we decided to develop a new user-friendly online computational tool able to carry out screenings of this CPA-catalyzed intermolecular α-amidoalkylation reaction space for a large number of chiral catalysts, substrates, nucleophiles, solvents, chiral products, and reaction conditions.First, we carried out a re-evaluation of all the available data in our record to obtain a better estimate of the chemical space of these reactions.Next, we developed a new PTML model using Heuristics and Monte Carlo sampling calculations without relying on costly computational calculations.This PTML model was able to predict the enantioselectivity with R 2 = 0.96 after a comparative study 332 reactions, which can be paired in > 100,000 ways, as each reaction can be a query or reference reaction.
Later, we developed the web server called MATEO (interMolecular Amidoalkylation Theoretical * Scheme 1 General scheme for CPA-catalyzed intermolecular α-amidoalkylation reactions Enantioselectivity Optimization), which is available at the online platform CPTMLTool (https:// cptml tool.rnasaimedir.com/).Finally, we have illustrated the practical use of the online tool with the experimental-theoretical study of a new set of CPA-catalyzed α-amidoalkylation reactions starting from bicyclic α-hydroxylactams 1 to construct the isoindoloisoquinoline framework 2 with a quaternary stereocenter.Electron-rich heteroaromatics (indole and pyrrole derivatives) 3 will be used as nucleophiles and chiral BINOL-derived N-triflylphosphoramides 4 as catalysts (Scheme 2).This new tool may help experimentalists in organic, medicinal, and materials chemistry to explore the chemical space of CPA-catalyzed α-amidoalkylation reactions and to optimize the reaction conditions for practical purposes.

Dataset and parameter studied
In this paper, we have carried out the study of the enantiomeric excess ee R (%) obs parameter in intermolecular α-amidoalkylation reactions.The value ee R (%) obs allows to quantify the enantiomeric excess by applying an (R)catalyst.This parameter is represented as ee R (%) obs = Sign (Prod)•Sign(CatR)•ee(%) obs , where Sign(Prod) = 1 for (R)product or Sign(Prod) = − 1 for (S)-product, taking into account an R or S notation of products experimentally obtained consistent with the Cahn-Ingold-Prelog (CIP) rules [31].The function Sign(Cat) = 1 for all reactions carried out with an (R)-catalyst, irrespective of the product obtained.On the other hand, the sign was switched from + 1 to Sign(Cat) = − 1 for the reactions carried out with (S)-catalyst and the sign Sign(Prod) was changed to the opposed.This operation transform (S)-catalyst reactions into (R)-catalyst reactions with the same absolute value of enantiomeric excess but opposed sign of ee R (%) obs .All reactions are expected to give the same result but with inverse configuration when you change the chirality of the Catalyst.Consequently, all reactions were studied as if they have been performed using an (R)catalyst keeping the (R)-catalyst when originally used or switching the signs of Sign(Prod) and Sign(Cat) for (S)catalyst reactions.In practice, this procedure will allow us to omit the use of chiral molecular descriptors for substrates, products, catalysts, etc., because all the chirality information will be included in the ee R (%) terms for the query or reference reactions (see next sections).In fact, the method worked properly in this specific case because all the reactions give products with only one stereogenic center.Consequently, we have all the chirality information necessary included in both sides of the equation without necessity of using chiral molecular descriptors.

Reaction condition variables
Apart from defining the molecular descriptors, we also consider different reaction conditions variables V k (c qi ) as input variables in order to quantify a k th property (k = 1, 2, 3) related to a general reaction condition (c q ) and/or specific reactant.In this chemical reaction dataset, the variables taken into account for the i th query reactions were: loading.By analogy, the values of variables considered for each j th reference reactions were:

Dataset studied, compounds and reactions notation
A dataset of 332 CPA-catalyzed enantioselective intermolecular α-amidoalkylation reactions has been compiled, which comprised 324 reactions obtained from literature (see Additional file 3) and 8 new reactions studied in this work for the first time (see Table 8).These reactions have been grouped into 34 families according to the different structural patterns of the substrates, nucleophiles, and catalysts.There are different types of substrates S (mostly cyclic and bicyclic α-hydroxylactams, but also 3-hydroxyindolines) that are reacted with different types of nucleophiles Nu (indoles, pyrroles, Hantzsch esters, enols and enamides) using CPAs (phosphoric acids or the corresponding N-triflylphosphoramides and sulfonamides) as catalysts Cat.
All compounds have been labeled with a 5-element code Xyznn, X = S for Substrates, X = Nu for Nucleophiles, and X = Family of Catalysts; y = is the structural family (a, b, c,…), z = is the structural sub-family, if any (a, b, c, …), and nn = is the ID number of the compound in the dataset.When the structural sub-family is missing, the label y in the notation is omitted.Then, a code was created to classify each reaction in the dataset into different reactions types based on the structure of the molecules involved.Thus, the values of the family label y of the Substrate, Nucleophile, and Catalyst were concatenated in this order to obtain the ID code of each reaction type.For example, the reaction of the Substrate S03aa with the Nucleophile Nua04 and the Catalyst Fab04 belongs to the reaction type with the ID code aaa.Scheme 3 shows selected examples of different reaction types included in the dataset using different types of cyclic hydroxylactams as substrates (S03, S04, S06) and different nucleophiles, such indoles (Nua) [32,33] enamides (Nuf) [34] or Hantzsch esters as reducing agents (Nuc) [35], with CPAs catalysts (F).The full experimental detail of each of the 324 reference reactions (substrate, nucleophile, catalysts, catalyst loading product, solvent, temperature, time, yield, % ee) is included in the Supporting Information (Additional file 3), which also includes the SMILE code of the substrate, nucleophile and catalyst in each case.To have a general view of the chemical space in the dataset, general schemes for all reactions included in the reference dataset are included in the Supporting Information (Additional file 1: Schemes S1 to S9).The structures and codification of substrates (S), nucleophiles (Nu), and catalysts (cat.) is included in the Supporting Information (Additional file 1).

Molecular descriptors calculation
First, the web tool MMDcalc was used to calculate the molecular descriptors D k (m sqi ) g and D k (m sri ) g of the molecules m sqi and m sri involved in the query and reference reactions [36].

ML linear model
In this section, D k (m sqi ) g values were introduced in order to look for a linear ML model.It is worth mentioning (1) that each entry line of the dataset denotes only one query reaction (R qi ).The enantiomeric excess ee R (%) qicalc of the query reaction (R qi ) was predicted by applying both variables V k (c qi ) as input depending on the experimental conditions and the molecular descriptors D k (m sqi ) g of the molecules taken into consideration in the reaction.With both sets of variables as inputs, we can seek a linear AI/ ML additive model.A best practice, the following equality holds ee R (%) calcqi ≈ ee R (%) qiobs, when the additive linear hypothesis is correct.The general additive form of AI/ ML model to be developed is the following.

PTML linear model
The PTML model is a well-known approach that can be used to predict the reactivity of a new case (reaction) through making comparisons with other known reactions.Our model can provide as output the ee R (%) calcqi .On the other hand, the ee R (%) calcqi is calculated for a query reaction(R qi ) due to the observed enantiomeric excess ee R (%) rjobs = ee R (%) refj of a reaction (R rj ) used as reaction of reference is already known.For this reason, the dataset applied to train/validate the PTML model, each entry line takes into consideration a pair of reactions, specifically a query reaction compared to a reference reaction (R qi vs. R rj ).The PTML linear model enables to predict ee R (%) calci starting with the experimental value of ee R (%) refj of a reference reaction.Afterwards, the model includes the influences of different structural, operational or experimental conditions variations (perturbations) in the query in regard to the reference reaction.We use PT Operators (PTOs) in order to quantify these variations or perturbations.The parameter of PTOs are denoted as the form ΔD k (m sqi , m srj ) g for structural variations and ΔV k (c qi , c rj ) for variations in the experimental reactions conditions.The formula of the PTML models used in this section are shown in Eqs. 3 and 4; (2 In this work, the linear additive model used as a function of reference ee R (%) robs and two sets of PTOs represented by ΔV(c qi , c rj ) and ΔD(m sqi , m srj ) g as input.The function of reference ee R (%) robs is equal to the observed values of enantiomeric excess ee(%), when the reference reaction used a (R)-catalyst with R configuration.We have developed two types of PTO in order to seek the PTML linear model.On the one hand, the first type of PTO is described as It takes into account the perturbations/deviations in the values of the k th variables/conditions of reactions V(c qi ) of the q th query reaction against the original values of the same variables V k (c r ) for the r th reaction of reference.On the other hand, the second type of PTO is denoted as: ΔD k (m sqi , m srj ) = [D k (m sqi ) -D k (m srj )] g .It considers the perturbations/deviations in the values of the molecular descriptors of the query with respect to the reference molecules.Subsequently, the input variables for the reaction of the reference V k (c rj ) are related to a k th property (k = 1, 2, 3).The connection between the input variables and k th property enables the connection in terms of general experimental conditions of reaction (c rj ) and/or specific reactants: V 1 (c rj ) = T( o C) = Temperature, V 2 (c rj ) = t(h) = reaction time, and V 3 (c rj ) = L(%) = catalyst loading, for the reaction of reference (R rj ).The input variables denoted as D k (m ri ) g are the molecular descriptors of type k th for the (4 i th molecules (m sri ) of type q th involved in the reference reaction (R rj ).Analogously, the molecules m ri taken part in the reaction of reference are m r1j = Substrate rj , m r2j = Nucleofile j , m r3j = Catalyst rj , and m 4rj = Solvent rj .In addition, we use the k th types of molecular descriptors as the same way as for the query reaction D 1 = Number of Valence Electrons (Zv), D 2 = Van der Waals Volume (Vvdw), D 3 = Sanderson Electronegativity (χ), D 4 = Polarizability (α), and D 5 = Electron Affinity (EA).In Table 1, we illustrate the detailed information about of all the PTOs used as input variables in the PTML models.

AI/ML vs. PTML linear model development
So as to seek the AI/ML and PTML linear models, we apply Multivariate Linear Regression (MLR) and Linear Neural Network (LNN) algorithms by using the software STATISTICA [38].In this sense, in the PTML regression models, the values of observed (experimental) enantiomeric excess ee R (%) obsqi against multiple values of reference ee R (%) refj have to be fitted.The regression model allows to generate artifacts in the standard distribution of the data [39].The parameters a k,s b k,s,g and e 0 are the coefficients of the model to be fitted by AI/ML algorithms.The formula for the PTML linear regression models was fitted as presented in the Eq.5; (5 Table 1 Definition of variables used as inputs of the PTML model a Molecules (m) involved in the reaction with distinguishable roles: m qsi = Substrate (Sub q ), Product (Prod q ), Nucleophile (Nuc q ), Catalyst (Cat q ), and Solvent (Solv q ) b PTOs with formula ΔV(m q , m r ) g = [V(m q ) g -V(m r )] g .These PTOs measure the variation of the value of the molecular property/structural variable (V) in the query molecules m q with respect to the value for molecule m r with the same role in the reaction of reference.The values of V k (m q ) g are average values of the properties V k = Sanderson Electronegativities (χ), Polarizabilities, etc., for all the atoms in the group g and all their neighboring atoms placed at a topological distance k ≤ 5. Consequently, these properties have been calculated for all the atoms in the molecule (Tot) or for subsets of atoms (group g).The groups of atoms studied are g = unsaturated carbons (C uns ), saturated carbons (C sat ), Heteroatoms (Het), Heteroatoms non-Halogen (HetNoX)

Experimental conditions (c q )
Perturbation operators b Type of operator

HPTML linear model
The PTML linear model built can predict diverse outputs for the same reaction taking into consideration the selected reference reactions.Therefore, in this section we introduced different Heuristics (H) in order to define the best reaction performance or set of reactions as reference.In this work, specifically we used two following heuristic.On the one hand, the first heuristic (H 1 ) can calculate the final predicted value as this form: ee R (%) qrpred = ee R (%) qrmin .This value is obtained using as reference the reaction with a minimum (Min) value of the PTOs in other words, the minimal deviation.Specifically, the heuristic (H 1 ) uses as reference, the reaction with a minimal difference/deviation (Δ) between the input variables ΔV(m qsi , m rsj ) and ΔV(c qi , c rj ) for all (∀) pairs of reactions.On the other hand, the second heuristic (H 2 ) can calculate the value ee R (%) qrpred = ee R (%) qra vg = Avg(ee R (%) qrcalc ).Particularly, the heuristic (H 2 ) uses as reference the values of variables ΔD(m qi , m rj ) (molecule structural variations) and ΔV(c qi , c rj ) (experimental conditions variations) for all (∀) pairs of reactions.As the first step, we calculated the 331 different ee R (%) qrcal values, not including the query.Then, we obtained the final values as the average for all the references.These two heuristics can be described as illustrated in Eqs.6 and 7. (6)

Monte carlo simulation
Most reactivity prediction models already reported take into consideration only the structure of the reactants but omit the values of temperature, catalyst loading, time of reaction, solvent polarity, etc. when predicting the enantiomeric excess of the reactions.In fact, many of the works focus only on yield at specific conditions of T, time, load, etc., and do not predict the enantiomeric excess.In addition, the values of enantiomeric excess, T, time, load, solvent polarity, etc. when measured experimentally contains a certain degree of error because most researchers do not measured them for triplicate or lead them uncontrolled like when using room temperature conditions. (7) In this context, the Monte Carlo Simulation (MC) starts with the original values of the non-structural variables T, t, Load and using a random generator creates new values with small variations with respect to the original values.MC experiments are a wide-ranging class of computational algorithms that base on repeated random sampling to obtain numerical results.This method are among the most useful data sampling in Cheminformatics [40][41][42].
In this work, we used an MC algorithm to predict the enantiomeric excess of the reactions taking into consideration all these factors, which are of the major relevance to optimize the reaction in the laboratory.In order to demonstrate the robustness of the model we generated a new set of reactions with "perturbations" in the values of T, t, Load, etc. and retrained the models.The values of the values of T, t, Load, where changed randomly but inside the limits of min and max reported for this reactions.This allowed to test the robustness of the model in terms of ability of the model to continue working properly (giving good predictions) despite of changes/errors etc. in the reports of temperature, time, etc.
For this purpose, we generated a new set of reactions with "perturbations" in the values of T (ºC), t(h), Load(%), etc. and retrained the models.The values of T (ºC), t(h), Load(%) where changed randomly between the limits set in the minimum V k (c qi ) min and maximum V k (c qi ) max reported for this type of reactions.The synthetic data allow to test the robustness of the PTML model in terms of ability to continue giving good predictions despite of changes/errors, etc.In addition, the values of minimum V k (c qi ) min , maximum V k (c qi ) max , and step V k (c qj ) step for all the operational conditions were calculated (Table 2).Afterwards, we used a MC model based on the following system of equations in order to create the new synthetic data.

Table 2 Summary of basic statistics for reactions in the dataset
Firstly, the Eqs.8 and 9 were applied so as to generate new V k (c qi ) new values starting from the original minimum value V k (c qi ) min (Eq.8).Later, with the Eq. ( 9), we obtained the new synthetic data value V k (c qi ) synth after introducing a boundary condition.This boundary condition is set up taking into consideration the conditions of α-amidoalkylation reactions.In other words, the boundary condition keeps the synthetic values V k (c qi ) synth within the range [V k (c qi ) min , V k (c qi ) max ].The synthetics values were created for the experimental condition variables and only if ) they are lower than V k (c qi ) max ; otherwise, they are equal to V k (c qi ) max .The function Rnd(0, n max ) is a generator of pseudo-random natural numbers (n = 0, 1, 2, … N max ) based on Mersenne-Twister MC algorithm (MT19937).The same system of equations was used to form new synthetic data for the input variables of the reference V k (c rj ) equation.
As mentioned above, we have only made small random changes to the values of the input variables t, T, and catalyst loading from the original ones.Consequently, in the new synthetic data cases generated by MC, we assumed that the deviations in the new values of input variables (perturbations) from the original ones are small enough to cause unetectable/non-measurable changes in the output values of ee R (%).The supposition is based on practical empiric evidence, which seems to confirm that new reactions/repetitions carried out with small changes of a few degrees of Temperature, minutes of reaction time, or catalyst loading will not alter i the value of eeR(%) by a measurable amount.In fact, in Eq. ( 8) the new synthetic value is equal to the minimum value in all the dataset plus the value of the step multiplied by a random value getting values 0, 1, 2, n max .

Experimental methods
We describe here the typical procedure for the enantioselective intermolecular α-amidoalkylation reaction leading to the synthesis of ( +)-2e (See Table 8, entry 8).For full experimental details and characterization data for (8)

CPA catalyzed α-amidoalkylation reactions chemical space
As stated above, the chemical space of α-amidoalkylation reactions is very wide.In this work, the dataset is based on 332 reactions which contains 55 different substrates (cyclic and bicyclic hydroxylactams), 53 nucleophiles (enamides, indoles, etc.), 39 chiral catalysts (phosphoric acids, phosphoramides, etc.), and 17 different solvents undertaken by multiple experimental conditions (see Supporting Information, file SI00.pdf for structures and reaction schemes; see Additional file 3 for full details of each reference reaction, including reaction conditions, yield, enantiomeric excess, and SMILE codes for reactants and catalysts in each case).The combination of all possible substrates, catalysts, and reactions conditions to be explored is potentially high to be covered by trial and error experiments.To better understanding the amount of all possible combination, we illustrate an example, if reactions are run independently by changing one reactant at a time, a total of N comb = N(Subs On the other hand, there are also important variations in the three main experimental condition variables V k (c qi ) [T( o C), t(h), and L(%)].Table 2 shows different statistics parameters of these variables for the reported reactions.The integer values for maximum (T max , t max , and L max ), minimum (T min , t min , and L min ), and step (T step , t step , and L step ) are included.This is important because the expression Range gives us the range of this variable that can be covered in actual practice in the laboratory.Consequently, when this range is divided by the minimum value, we decided to change in practice Step [V k (c qi )], the number of experiments N(c qi ) = Range[V k (c qi ))/Step(V k (c qi )] that we can run in order to explore this variable can be obtained.When reactions are run independently by changing one experimental condition at a time, a total of N exp experiments must be run.This will be equal to  2).The multiplication of both parts of the equation gives an estimate of the very large number of reactions accessible in this chemical space N(R qi ) max = N comb •N exp ≈ 10 11 .The equations used to carry out the calculations of the number of reactions in this chemical space are shown below (Eq.10) [39]: (10) Step(t) Step(L) The previous calculation gives an idea on the dimension of chemical reaction space for enantioselective CPAcatalyzed intermolecular α-amidoalkylation reactions.It is inviable to study all possible combinations in the laboratory due to the time and cost in material and human resources.In the daily practice, chemists can use expert criteria and experimental design techniques to reduce the number of combinations to be tested, to decrease the range of the different experimental conditions variables, etc.This can support researchers to reduce meaningfully the number of reactions to perform in the practice.However, the use of the previous well-known experimental expert criteria, researchers will never test interesting products.Therefore, the main objective of this project was the development of a new user-friendly predictive regression model for these reactions.This predictive model may become a useful tool to reduce the time and cost of experimentation.

ML linear model for α-amidoalkylation reactions
In the α-amidoalkylation reactions, there is no clear relationship between the chirality of the catalysts and the CIP notation of the product.In fact, in our literature dataset one can note the following ratio of Catalyst/Product chirality relationship, count, and ratio (R)/(R)140 reactions (43.2%), (S)/(R)102 reactions (31.5%), (R)/(S) 72 reactions (22.2%) and (S)/(S) 9 reactions (2.8%) of 324 reactions.There is only one reaction in the entire dataset with an (S)configuration catalyst and enantiomeric excess equal to zero.Therefore, it is very important to have a computational model to predict the absolute stereochemistry and the enantiomeric excess of the reaction product.This type of models could be used as a useful tool in order to address the design of new catalysts and/or selecting the optimal reaction conditions a priori.In this work, we decided to tackle this problem using AI/ML techniques.We trained this classic linear ML model using only the Original Data (OD) from reactions.The equation of this model is shown in Eq. 11;  3. The model obtains 74.0% of variance (coefficient R 2 = 0.74), which is an acceptable prediction percentage for organic synthesis reactions (although extremely improbable).By the way, the SEE = 37.1 could be considered relatively high [39].On the other hand, an essential short-coming of this classic ML linear model is that it does not provide us any evidence about the most similar reactions conveyed in the scientific literature.Consequently, this may limit our ability to deduce possible mechanisms and/or compare our results with others already known.Therefore, this ML (11) ee R (%) qpred = 912.48   model needs to be used along with another search strategy for similar molecules to obtain clues of similar reactions for a specific reaction under study.One option is to couple this model with similarity search strategies based on Tanimoto's similarity indices [43].In fact, there are interesting works that report the coupling of Cheminformatics models with search strategies based on similarity [44][45][46].A well-known example of online search tools is the Scifinder platform [47,48].

PTML model for α-amidoalkylation reactions
As mentioned in the previous section, we have reported a PTML model for α-amidoalkylation reactions, although it is difficult to use in practice and not implemented on a publicly available online web server.Unfortunately, the input variables used in that model are not available as an open source code.For this reason, it could be advantageous to implement the model on a public online server.Consequently, we decided to develop a new linear PTML model using our own library to calculate the molecular descriptors.PTML reactivity models can study pair-wise reactions [39].The model infers the reactivity of a query reaction (q) by comparing it to a previously known reference reaction (r).Some PTML models use different Heuristics (H) to match q and r reactions.These models can be called HPTML models.The Fig. 1 illustrates the general workflow that has been followed during this word to look for the new HPTML models.In step 1, the reference dataset and reaction pairs q vs. r were created.In step 2, the SMILE codes of the molecules (m qsi , m rsj ) involved in both q and r reactions (substrates, nucleophiles, catalysts, solvents, products) were entered in the MCDCalc server [49] to calculate their molecular descriptors D k (m qsi ) g and D k (m rsj ) g .In step 3, the PTOs for pairs of reactions were calculated.In step 4, the Multivariate Linear Regression (MLR) algorithm implemented in the STATISTICA [38] software was used to seek the PTML model.In step 5, heuristics H 1 and H 2 were tested interactively.In step 6, the best HPTML model was selected.Finally, in step 7, this model was implemented on a public web server (see the following sections).The best linear HPTML model found is shown in Eq. 12; (12)   The HPTML model was trained with a total of n train = 78,732 arbitrarily selected reaction pairs.The statistical parameters obtained for this model are the regression coefficient value of R train = 0.84 and Standard Error of Estimates SEE = 51.67 and a Fisher's ratio of F = 15,238.7 with a p-level < 0.05 in training series.This points out a important relationship between the observed relative values of ∆ee R (%) qrobs and the predicted values ∆ee R (%) qrobs .
In addition, another subset of n val = 28,836 reaction pairs was used to validate the model.A regression coefficient R val = 0.77 and SEE = 60.225 were found for this validation series.The output of the model is ee R (%) qrcalc .This variable represents the enantiomeric excess value calculated using a single reference reaction.The ee R (%) calc value quantifies the enantiomeric excess obtained using an (R)-catalyst.If ee R (%) calc > 0, the product is predicted to have (R) notation; if ee R (%) calc < 0, the product is predicted to have (S) notation; if ee R (%) calc = 0 racemic mixture.The overall p-level of the model is p < 0.05.All the variables introduced in the model are statistically significant (Table 4).The three first input variables quantify the effect of non-structural factors on the enantioselectivity parameter, ee R (%) calc .The remaining input variables quantify the contribution of structural variations in the Substrate (Sub), Catalyst (Cat), Product (Prod), Nucleophile (Nuc), and Solvent (Solv).

PTML calculations with a single reference reaction
As we explained above, this PTML reactivity model studies pair-wise reactions.To avoid distortions in the distribution of the variables, PTML model uses the variable ∆ee R (%) qrobs as objective function (see Eq. 5) [39].This objective function is the function to fit and is equal to ∆ee R (%) qrobs = ee R (%) qobsee R (%) robs .As a result, the output of the new model is ∆ee R (%) qrcalc = ee R (%) qcalc -ee R (%) rcalc .For non-accurate models ∆ee R (%) qrcalc ≠ ∆ee R (%) qrobs (where ≠ indicates not ≈).Conversely, for a not-random accurate predictor, like this one, one can approximate ∆ee R (%) qrcalc ≈ ∆ee R (%) qrobs .This presupposes that ee R (%) qcalc ≈ ee R (%) qobs and ee R (%) rcalc ≈ ee R (%) robs .Therefore, for practical purposes, we use the model to predict the enantiomeric excess of new query reactions ee R (%) qcalc , based on the observed enantiomeric excess of a reference reaction ee R (%) qrobs .The approximation is only valid for notrandom accurate predictors and takes into account that ee R (%) rcalc ≈ ee R (%) robs is always a known reference reaction, so it is necessary to rearrange the variables in Eq. 5 as shown in Eq. 13; As a result of this approach, the model calculates different values of ee R (%) calcqi for the same reaction depending on the experimental value ee R (%) refj of the reaction used as reference in the pair [39].Figure 2 illustrates the observed values of Δee R (%) qrobs vs. the predicted (calculated) values of Δee R (%) calcqi for 10,000 selected reaction pairs.We depict only 10000 pairs due to software plotting limitations (this the top number of points allowed by the software).A certain linear trend is observed (points with ∆ee R (%) qrcalc ≈ ∆ee R (%) qrobs ), however, despite being a predictor with adequate goodness of fit, there are many points with higher dispersion (points with ∆ee R (%) qrcalc ≠ ∆ee R (%) qrobs ).
In fact, PTML models may be included on a broader class of learning problems, such as delta ML, transfer ML, template selection ML, etc. [50][51][52][53].In general, these models involve the use of a query item (item to be predicted) compared to a reference item (template, pair, known case, item from related domain, etc.).To calculate the output of a query item (quantum field, drug, protein, or reaction in this case), it is necessary to use an already known item or population of reference items as input.Query items can be in the same or a different data domain from the reference item.In this context, the low population (low number of available cases) of some of the studied data subset (data domains) is also a common problem.In our case, to calculate the value of ee R (%) calcqi for a query reaction (q), the observed ee R (%) refj values of an already known reference reaction (r) must be used as input.Here both the query and reference items come from the same data domain (both are the same type of reactions).The reaction of reference can be selected from our reaction dataset (same data domain) [54].Consequently, for a new query reaction, there are n = 332 reactions in the dataset that can be used as the reference reaction, which pave the way for the question of which is/ are the best candidate/candidates to be used as reference (13) ee R (% − 1534.17• � ∝ Prod q , Prod r HetNoX − 215.98 • �EA Prod q , Prod r Csat − 1747.12 • �EA Cat q , Cat r HetNoX − 42.49 • �χ Nuc q , Nuc r Het + 750.76 • �χ Cat q , Cat r HetNoX − 34.19 • �V Sub q , Sub r Tot + 22.04 • �Zv Cat q , Cat r Cuns − 12.46 • �Zv Solv q , Solv r Cuns − 0.91 reaction in each case (see next section).Thus, 332 different values of ee R (%) calcqi can be calculated for the same query reaction based on the selected pairing reaction of reference.In this step, heuristic rules can be used to approximate the final predicted value ee R (%) qpred depending on the ee R (%) calc values of the model, as we have demonstrated previously to solve a similar problem [39].

HPTML model for prediction with multiple reactions of reference
As mentioned above, it is necessary to define the best reaction or set of reactions to use.Defining an appropriate reference reaction can also help reduce the dispersion and increase the value of the regression coefficient, because each query reaction will have a single predicted value.With this purpose, a Heuristic rule coupled to the PTML model can be used to select the best reference.Heuristic-based methods have been widely used in Cheminformatics to solve practical problems [55][56][57].In our case, the combination of the PTML model with a Heuristic (H) rule defines the term HPTML = H + PTML algorithm.Two Heuristics (H 1 and H 2 ) were tested by calculating the ee R (%) qrpred values for the 332 reactions in our dataset, using the  PTML trained with the OD set.These HPTML models based on Heuristics H 1 and H 2 were compared with a classic ML model.This classic ML model includes no PT terms and was built without using Heuristics (H 0 ). Figure 3 shows a schematic illustration of the ML, PTML, and HPTML data re-arrangement, as well as the MC data enrichment procedures used here.Table 5 shows the statistical parameters for these studies (see only entries with Data = OD).Detailed information can be found in Additional file 2: Table S1of the Supporting Information file (Additional file 2).It should be noted that both HPTML models using Heuristics give good results with an OD regression coefficient in the range R 2 = 0.64-0.81and p < 0.05.Specifically, the HPTML OD H 1 model has a higher regression coefficient (R 2 = 0.81 vs. 0.55) and a lower SEE (R 2 = 29.5 vs. 37.1) than the classic ML model.However, this SEE value is still relatively high.Interestingly, MC data enrichment improved both R 2 = 0.96 and SEE = 13.5 values of the HPTML OD H 1 model.In addition, the HPTML model automatically provides the most similar reference reaction from the reference dataset, including the reference of the article, which might give some clues about the possible reaction mechanism, etc. of the query reaction.In contrast, the classic ML model does not give information about the plausible reaction mechanism or similar reactions in the literature.Overall, these results justify the use of the HPTML algorithm instead of the classic ML algorithm.
Interestingly, the pair-wise strategy can rapidly increase the number of cases, as you go from datasets with n items (reactions) to n x n items (pairs of reactions).In this case, we go from n reacc = 332 reactions to n pairs = 107,626 pairs of reactions, which could be an advantage of PTML model, since increasing the number of items to train the ML model can improve learning.However, those items that are underrepresented in the original data are still underrepresented in the new data in relative terms.In addition, you take the risk of including mismatched pair, that is, you take the risk of trying to predict an  underrepresented query item (reaction) using as reference an overrepresented item (reaction family) that is not similar to the reference.For example, reactions from the aaa family are generally the most represented with n reacc = 120 cases (36.14% of cases) and n pairs = 37,570 (34.91%) including many pairs with reactions from the same family.In contrast, reactions from the dab family are very poorly represented (low abundance) with only n reacc = 3 cases (0.9% of cases) appearing in n pairs = 995 pairs of reactions.Almost all of these pairs are formed with reactions from other families and the relative abundance remains low (0.9%).Table 6 shows the absolute and relative abundance of different reaction families (subsets) in the original dataset and the number of pairs formed with them.It should be noted that the formation of pairs of mismatched reactions can lead to inaccurate predictions.For example, predicting a query reaction from the aab family may give an inaccurate result if we use a reaction from the haa family as reference, because aab reactions have an average enantiomeric excess < ee R (%) > qobs = 21.0 while haa reactions have < ee R (%) > qobs = -78.1.Both reaction families not only have a markedly different average enantiomeric excess, but also give products with reverse (R) or (S) CIP notation of absolute configuration [31].The compound codes, SMILE codes, and chemical structures of the different families of substrates, nucleophiles, and catalysts are shown on the Supporting Information file SI00.pdf.
In this regard, synthetic data generation techniques can be used to palliate the presence of low populated data subsets.In any case, the total abundance of each enriched data subset should remain essentially constant to avoid creating data artifacts.MC sampling methods have widely used in chemistry for similar purposes [58].To palliate this situation, we have used a Mersenne-Twister MC algorithm (MT19937) [59] for data enrichment by creating new synthetic data.Therefore, synthetic data cases of the input variables V k (c qi ) = T(°C) qi , t(h) qi , or L(%) qi of query reactions were generated using a MC algorithm (see system of equations in Materials and Methods section).The same MC algorithm (system of equations) was used to generate new synthetic data for the input variables of the equation of reference V k (c rj ).Nevertheless, the molecular descriptors D k (m sqi ) and D k (m srj ) were never modified in the MC data enrichment simulation, because one can reasonably expect that small changes in the input reaction condition Fig. 4 HPTML ee R (%) observed vs. predicted values with Eq. 12 (R 2 = 0.98) after applying both MC synthetic data and best Heuristic rule (ODMC + H 1 ).Overall data for training and validation series.The reaction number from the database (See Additional file 2) has been included for selected examples variables [V k(cqi) = T(°C), t(h), or L(%)] do not to significantly change the output ee R (%).However, the same cannot be guaranteed for changes in chemical structure.Thus, we obtained a slightly higher number of cases for very low abundant reactions.For example, we were able to add n mcpairs = 15, 20, or 40 new cases for the dab, aab, and eab families of reactions; but we kept their relative abundance essentially low in the range, 0.9-2.47%.
Table 6 shows that both models trained with the ODMC dataset (OD enriched by MC) give essentially the same value of R = 0.8-0.9 and p < 0.05 obtained with OD alone.However, the error decreased from SEE = 29.5% to SEE = 13.5% using Heuristic H 1 .Table 7 shows the correlation matrix for the outputs of all models that illustrates the high correlation obtained among them, R = 0.80-0.99.The results of ee R (%) qrobs observed vs. ee R (%) qrpred  8) predicted with this HTPML model using ODMC dataset and H 1 heuristic are graphically depicted in Fig. 4, where each point corresponds to a reaction included in the dataset.It can be graphically observed that although an excellent correlation of the predicted and obtained ee(%) value is generally obtained, some values are far from the line of correlation.In selected cases, the corresponding reaction number from the database (See SI001.xlsfile) has been included.It is difficult to draw any conclusions from these cases, as the reactants used are structurally heterogeneous and the experimental conditions diverse as well.In any case, the model has already a very high R 2 = 0.98 value.We can conclude that using ODMC enriched data decreased the error of the model without decreasing the regression quality.

HPTML vs. Experimental study of new reactions
In this section, we report an additional test of the HPTML model comparing the computational predictions with the experimental study of new reactions.Thus, we performed both an experimental and a theoretical study of new intermolecular α-amidoalkylation reactions not previously reported in the literature.First, the α-amidoalkylation reactions carried out experimentally are described.Next, we report the use of the HPTML model to predict these reactions and compare the results with the experimental values.

Experimental study of α-amidoalkylation reactions.
As stated above, the α-amidoalkylation reaction is a very attractive method for C-C bond formation in organic synthesis.In this context, we have previously reported [27] that the α-amidoalkylation reaction is an efficient procedure for the enantioselective synthesis of 12b-substituted isoindoloisoquinolines (Nuevamine-type alkaloids [60]) using BINOL-derived Brønsted acids as catalysts.It should be pointed out that these catalysts have been used in intermolecular α-amidoalkylation of indoles with cyclic N-acyliminium ions formed in situ from cyclic hydroxylactams to form tertiary or quaternary stereogenic centers, but this was the first example of bicyclic N-acyliminium intermediates in intermolecular α-amidoalkylation reactions of indoles [30].The best results were obtained using a sterically demanding CPA (20 mol% catalyst loading) under the following conditions: THF as solvent at room temperature for 24 h.However, in some cases, moderate enantioselectivity (enantiomeric excess) and/or yields were obtained.Therefore, we decided to test BINOLderived N-triflylphosphoramides as catalysts to enhance the enantioselectivity of these reactions, because they are known to have an increased acidity when compared to the corresponding CPAs, so they can form tighter ion pairs leading to an improved reactivity [61,62].Thus, the N-triflylphosphoramides 4a-d were synthesized [63,64] and tested as catalysts in the reaction of 12b-hydroxyisoindoloisoquinoline 1 with the indoles 3ad (Scheme 4).Table 8 summarizes these new results compared with those previously obtained with phosphoric acid 5e, which has demonstrated to be the most efficient catalyst for indole [30].The best results were obtained with the catalyst 4a, although good to excellent yields were achieved with all the phosphoramides.Successfully, we were able to improve our previous result obtaining   with the corresponding phosphoric acids, obtaining 2a with excellent yield and enantioselectivity (90, 93% ee).
In addition, the intermolecular α-amidoalkylation reaction was extended to 5-substituted indoles 3b-d, obtaining excellent yields, even when a strong acceptor group (NO 2 ) was introduced (Table 8, entry 5).However, the use of the substituted indoles led to lower enantiomeric excesses (Table 8, entries 5-7).The reaction could also be applied to other electron-rich heteroaromatics as pyrrole 3e, obtaining 2e quantitatively, although with moderate ee (Table 8, entry 8).In this case, the reaction was cleaner and faster (reaction completed in 5 h) than when using phosphoric acid 5e as catalyst (Table 8, entries [13][14][15].

HPTML prediction of new α-amidoalkylation reactions
Next, using the developed HPTML ODMC H 1 model, we predicted the values of ee R (%) for the new enantioselective intermolecular α-amidoalkylation reactions.We first calculated the molecular D k (m qsi ) g descriptors of all the molecules (Substrate qi , Nucleophile qi , Catalyst qi , Solvent qi , and Product qi ) involved in the new query reactions (R q ) using the web server MCDCalc [38].Then, the Heuristic H 1 was used to find the best reference reaction for each new query reaction.Next, we substituted in the model equation the values of the molecular descriptors D k (m qsi ) g and D r (m rsj ) g of the molecules, as well as the values of the input experimental conditions variables V k (c qi ) and V k (c rj ), from both the query (R q ) and reference reaction (R r ), respectively.Table 9 shows the predicted ee R (%) values for each reaction compared to the values The other HPTML models have notably larger residuals values, confirming our decision to discard them as good predictors for this type of reaction.In general, the best results are obtained with the HPTML ODMC H 1 model.For a total of 6 out of 8 reactions the model almost perfectly predicts the observed values of ee R (%) qrobs with residual values in the range ee R (%) qrres = − 1.1-1.9%(reactions 1, 2, 5-8) (Table 9).The experimental and predicted values for the obtention of 2a-e using catalyst 4a are represented in Scheme 5.For the other two reactions, the model correctly predicts the absolute stereochemistry of the final products, although with a relatively higher error.In addition to the results of training and validations series, these results validate the HPTML ODMC H 1 model as a useful predictor for enantioselective intermolecular α-amidoalkylation reactions.The Microsoft Excel software was used to run all these calculations.However, this HPTML calculation algorithm is slow because it is not automatic and need more than one software applications (MCDCalc, Excel) to run.Furthermore, the model is not available for use by other groups and requires some degree of expertise in Cheminformatics, so we decided to implement it on a public web server.

MATEO web server
The HPTML model was implemented on a new public web server called MATEO: interMolecular Amidoalkylation Theoretical Enantioselectivity Optimization.MATEO server is available for public use online (free of charge) through the link: https:// cptml tool.rnasa-imedir.com/ CPTML Tools-Web/ mateo.The graphical interface of the web server is shownin Fig. 5.Users worldwide can upload their own sets of query reactions to predict the values of ee R (%) qrcalc under different experimental conditions (solvent, time, temperature, catalyst loading), see Table 10.
Figure 6 graphically illustrates (from bottom to top) the steps required to use this web server.Step 1 is to upload the chemical structures of all the molecules involved in the reaction.The server is required to upload the structures in the Simplified Molecular Input Line Entry Specification (SMILES) code format [65].SMILES has become a simplified and memory-optimal way of managing molecular structures widely used in Cheminformatics today [66,67].These codes can be pasted directly on the web interface or uploaded as a text file.The server allows uploading large collections of reactions with different combinations of substrate, nucleophile, and catalyst.This could be useful for exploring large libraries of molecules (products, substrates, and nucleophiles) and/or for the design of new catalysts.The server also allows uploading of the solvent structure, making it easy to explore a large variety of solvents.In Step 2, three general types of calculations can be selected: (1) Similarity Search, (2) Structural Scan, or (3) Conditions Scan.Option (1) allows us to predict the enantiomeric excess values, in addition to obtaining a report of the most similar reactions from the references in our dataset.Option (2) allows uploading the specific structures (substrate, nucleophile, catalyst, and/ or solvent) and running a scan of these molecules under reaction conditions similar to those reported in the literature.Option (3) allows to keep the structure parameters constant (same molecules), while the software performs a scan of different combinations of input variables (temperature, time, catalyst loading).Table 10 shows the range (minimum, maximum) and step of the variables allowed by the server.
In this context, Goodman et al. have recently developed a rule-based web tool BINOPtimal for the online selection of CPA catalysts in a related reaction, the addition of nucleophiles to imines, by analyzing the reagent structures [68].MATEO is web server allows the user to make quantitative predictions of enantiomeric excess parameter ee R (%) at different reaction temperature, time, catalysts loading or solvent polarity, which are known factors that affect the enantioselectivity of α-amidoalkylation reactions.Therefore, MATEO web server will be useful to guide not only the catalyst selection but also the experimental conditions.

Conclusions
In conclusion, we have shown that classic linear ML models are not very accurate in predicting the enantioselectivity of α-amidoalkylation reactions using physicochemical properties calculated with a Markov chain approach as input.Besides, these linear ML models do not allow detecting the most similar reaction directly from the model.The PTML algorithm outperforms the classic linear ML model using the same dataset and molecular descriptors.Moreover, the HPTML algorithm based on PTML model + heuristic rule allows direct detection of the most similar reference reactions.In addition, MC synthetic data re-sampling/enrichment procedures reduce the procedural error.The final HPTML model responds very well in computational experiments with validation series.The HPTML model also reproduces very well the experimental values of a new series of reactions studied experimentally by the first time in this work.Finally, the implementation of the HPTML model on the MATEO online server makes the algorithm available for public use worldwide with a user-friendly interface.

a
Stat. = Statistical parameters for the input parameters (operational conditions) of all the reactions present in our dataset: N reacc = Number of reactions present in our dataset, Avg.= average value, S.D. = Standard deviation, Max.= maximum value, Min.= minimum value, Range = Max.-Min., Step = minimal change allowed in one experimental condition, N expr .= Number of experiments (reactions) changing one condition and keeping the others constant b Operational conditions: T( o C) = temperature, t(h) = reaction time, Load(%) = catalyst loading Stat. a Dataset reaction conditions (c qi ) b qi )•N(Nuc qi )•N(C at qi )•N(Solv qi ) = 55•53•39•17 = 1,932,645 unique combinations of molecule subtypes should be run.This could be a new source of interesting products [changes in N(Subs qi ) or N(Nuc qi )] or a way to improve the reaction efficiency [changes in N(Cat qi ) or N(Solv qi )].This estimation considers only the combinations of different molecular entities.Unfortunately, the vast majority of these reactions remain unexplored in terms of high cost in time and resources.
N exp = N(c 1 )•N(c 2 )•N(c 3 ) = N(T)•N(t)•N(L) = [Range(T)/ Step(T)]•[Range(t)/Step(t)]•[Range(L)/Step(L)] = [144/10 ]•[(239/1]•[(28/1] = 96,365 optimization experiments for each unique combination of molecule sub-types giving as result an specific Product qi of the reactions R qi (Table This ML model does not use reference reactions for comparison.The statistic parameters of the model are n = 332, Regression coefficient R 2 = 0.74, Fisher ratio F = 59.2,Standard Error of Estimates SEE = 37.1, p-level p < 0.05.More detailed information about coefficients and variables of the model as well as symbols and names of variables, Standard Error (SE), Students' t values, and p-level are given in Table

Fig. 1
Fig. 1 HPTML models general workflow used in this work

Fig. 3
Fig. 3 HPTML data re-arrangement and MC data enrichment schematic illustration

Table 3
Results of the PTML regression model a Input variables with coefficient b k are the values of shift (Δ) in q-reac vs. r-reac for different properties: α = average atomic polarizability, EA = average atomic Electro Affinity, χ = average atomic Sanderson Electronegativity, Zv = average atomic number b Coefficients of the variables in the model, the output variable is the Δ in enantiomeric excess ee(%) * of the q-reac with respect to the r-reac when both reactions have been carried out with(R)-catalyst c Standard error of the coefficients d Student t-value e p-level of error

Table 4
Results of the PTML regression model a Input variables with coefficient b k are the values of shift (Δ) in q-reacvs.r-reacfordifferent properties: α average atomic polarizability, EA average atomic Electro Affinity, χ average atomic Sanderson Electronegativity, Zv average atomic number b Coefficients of the variables in the model, the output variable is the Δ in enantiomeric excess ee(%) * of the q-reac with respect to the r-reac when both reactions have been carried out with(R)-catalyst c Standard error of the coefficients d Student t-value e p-level of error Fig.2Observed vs. Predicted (Δee R (%) qrobs vs. Δee R (%) qrcalc ) for equation Eq. 12 (R = 0.84 in training series).Only 10,000 reaction pairs of reactions (cases) are depicted due to software limitations

Table 5
HPTML models obtained with different datasets vs. alternative heuristics a OD Original Data, MC Monte Carlo, ODMC OD + MC enriched dataset b n reacc Number of reactions present in our dataset c n pairs Number of pairs of reactions present in our dataset

Table 6
Selected subsets of reactions a Sub Substrate, Nuc Nucleophile, Cat Catalyst, patterns a, b, c, aaa, etc. are the different families of reactants/reactions, see the text, ND no data b OD Original Data, MC Monte Carlo, ODMC OD + MC enriched dataset.n reacc Number of reactions present in our dataset.n pairs Number of pairs of reactions present in our dataset, n mcpairs Number of pairs of reactions present in our dataset in MC experiments

Table 7
HPTML Data set vs. heuristics correlation matrix a OD Original Data, MC Monte Carlo, ODMC OD + MC enriched dataset

Table 8
Enantioselective intermolecular α-amidoalkylation reactions of N-triflamides vs. phosphoric acids as catalysts a Yield (%) of isolated pure compound, the symbols 2a -5e are the reactants and products, see scheme 4

Table 10
MATEO Web server operational conditions a Stat.Statistical parameters for the input parameters (operational conditions) of all the reactions present in our dataset: Max.maximum value, Min.minimum value, Step minimal change allowed in one experimental condition b Operational conditions: T( o C) temperature, t(h) reaction time, Load(%) catalyst loading MATEO server use workflow predicted with the other Datasets (OD vs. ODMC) and Heuristics (H 1 and H 2 ).